home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.lang.c
- Path: cwi.nl!dik
- From: dik@cwi.nl (Dik T. Winter)
- Subject: Re: Matrix Multiplication
- Message-ID: <DLwzv9.9wu@cwi.nl>
- Sender: news@cwi.nl (The Daily Dross)
- Nntp-Posting-Host: chrysant.cwi.nl
- Organization: CWI, Amsterdam
- References: <1996Jan22.110440@gamma.ntu.ac.sg> <4ecjv4$fn2@tamarack.cs.mtu.edu> <822792812snz@genesis.demon.co.uk>
- Date: Sun, 28 Jan 1996 23:28:21 GMT
-
- > >Consider optimizing for cache performance.
- [Omitting the initialization]
- > >for ( i = 0; i < N; i++ )
- > > for ( j = 0; j < N; j++ )
- > > for ( k = 0; k < N; k++ )
- > > c[i][j] = c[i][j] + a[i][k]*b[k][j];
- >
- > Or how about:
- > double sum = 0.0;
- ...
- > c[i][j] = sum;
- > You compiler might be able to perform that optimisation for itself but
- > aliasing issues may make the analysis difficult.
-
- In this case aliasing might make it difficult indeed or even impossible,
- and doing it is indeed an improvement. But the proposed version is *not*
- optimal for caching; the optimal way to do it is to make the 'j' loop
- innermost:
- for ( i = 0; i < N; i++)
- for (k = 0; k < N; k++)
- for (j = 0; j < N; j++)
- c[i][j] = c[i][j] + a[i][k]*b[k][j];
- [and this is also optimal on vector processors].
- The reason is that if at some stage c[i][j] (or b[k][j]) has a cache miss
- a whole cache line is brought in, and this includes some elements of 'c' and
- 'b' that are needed on next steps through the loop. Putting k innermost
- gives the benefit only to 'a'. (But it will also depend a lot on the
- cache replacement strategy used by the processor.)
-
- Timings on an SGI with a 100 MHz MIPS R4k (in seconds with N = 1000, 10
- multiplications):
- i innermost 11.08
- k innermost 5.92
- k innermost + scalar 4.60
- j innermost 2.59
- --
- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924098
- home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
-